1. Review of the Posterior Summarization approach

i) Definitions

  • \(y \in \mathcal{Y}\), and \(\mathcal{Y} \in \mathbb{R}^{d}\), with \(d \in \mathbb{Z}_{+}\).
  • Suppose \(y \sim f\), where \(f \in \cal{F}\).
  • \(f\) is an unknown density.
  • Our goal is to learn \(f\) from a set of \(n\) observations, \(y_1,y_2,\dots, y_n\) in \(\mathcal{Y}\).

ii) Bayesian model

  • We take a Bayesian approach in which we obtain:
    • The posterior \(\pi(f|y)\) of \(f\) given the prior \(p(f)\).
    • We choose \(p(f)\) to be flexible and fit the data well.
    • The problem is that by choosing \(p(f)\) we might have outcomes that are difficult to interpret.
    • \(f(\widetilde{y}\mid y)\) is the predictive distribution, that is also written as \(f(y_{n+1}\mid y_1,\dots,y_n)\).
    • \(\widetilde{y}\) is an observation from \(f(\widetilde{y}\mid y)\).

iii) Decision theoric approach to posterior summarization

  • Define \(g\) a density surrogate in a class \(\mathcal{G}\):

    • \(g\) is preferable to be parametric and lower dimensional.
    • \(g_{\gamma}\) is a surrogate density indexed by \(\gamma \in \Gamma\), where \(\Gamma\) is class of actions.
  • In the Posterior Summary framework, we first obtain the optimal dimension that approximates the Bayesian model through \(f(\widetilde{y}\mid y)\).

  • This is done by minimizing the loss function: \[ \widehat{\gamma} = \arg\min_{\gamma \in \Gamma} E_{\widetilde{Y}}[ \ell(\widetilde{y}; \gamma; g_{\gamma}) \mid Y] + \lambda P(\gamma) \]

  • The model \(\widehat{g}_{\gamma}\) indexed by \(\widehat{\gamma}\) is the density point estimate approximation of the model through \(f(\widetilde{y}\mid y)\).

  • This is similar to what was observed under DSS and related models. We usually would stop at this point.

  • But actually what is important about this step is that we can extract valuable information about the dimension of the surrogate.

  • Suppose \(K^* \in \mathbb{Z}_{+}\) is the dimension of \(\widehat{\gamma}\), that is defined on the class \(\Gamma^{K^*}\).

  • From this dimension we obtain the posterior around this estimate, which is the Posterior summary.

  • This is done again using a decision theoretic approach in which we obtain the posterior projection as \[ \gamma_{ps} = \arg\min_{\gamma \in \Gamma^{k^*}} \ell(\widetilde{y}; \gamma; g_{\gamma}) \]

  • In this case, \(\gamma_{ps}\) is the posterior summarization distribution of the parameter associated with the surrogate model.

  • We then can get estimates, like the posterior summarization mean from \(\gamma_{ps}\), and also credible intervals.

  • Its important to notice that: \[ \arg\min_{\gamma \in \Gamma} E_{\widetilde{Y}}[ \ell(\widetilde{y}; \gamma; g_{\gamma}) \mid Y] \neq E\left[\arg\min_{\gamma \in \Gamma} \ell(\widetilde{y}; \gamma; g_{\gamma})\mid Y\right] \]

2. Application to Bayesian Density estimation Summaries

i) Obtaining density estimates

  • We apply the posterior summarization framework to Bayesian density estimation. The idea is to project a non-parametric or overparameterized model to a lower-dimensional interpretable model. Here we use the class of finite mixture models (FMM) as surrogates since they have optimal properties and are easy to interpret.

  • We let \(g_{\gamma}\) to be a surrogate density defined as \[ g_{\gamma}(y) = g(y\mid \gamma) = \sum_{q = 1}^k\eta_q g_{q}(y\mid \gamma_q), \] where \(\sum_{q = 1}^k\eta_q = 1\), \(g_q(\cdot\mid\gamma_q)\) are densities and \(\gamma = (\eta_1,\dots,\eta_K,\gamma_1,\dots,\gamma_k)\).

  • We define the loss function as \(\ell(\widetilde{y}; \gamma; g_{\gamma}) = -\log g_{\gamma}(\widetilde{y})\).

  • This yields the optimal estimate \[ \widehat{\gamma} = \arg\min_{\gamma \in \Gamma} E_{\widetilde{Y}}[ -\log g_{\gamma}(\widetilde{y}) \mid Y]. \]

  • Let \(\widetilde{y}_1,\dots,\widetilde{y}_{\widetilde{n}} \sim f(\widetilde{y}\mid y)\) be a sample of the predictive posterior distribution. We can then approximate the optimization step above as

\[ \widehat{\gamma}^{k} = \arg\max_{\gamma \in \Gamma} \frac{1}{\widetilde{n}} \sum_{n=1}^{\widetilde{n}} \log \sum_{q=1}^{k} \eta_{q} g_{q} (\widetilde{y}_{n} | \gamma_{q}). \]

  • In this case, we replace the penalty on the lost function by a greedy search through different parametric of \(\Gamma^k\), for \(k = 1,2,\dots,K_{\max}\), where \(K_{\max}\) is chosen to be large.

  • We solve this using the EM algorithm provided by the mclust package for \(k = 1,2,\dots,K_{\max}\).

  • The idea is to for search the different projections for the model of size \(k = 1,2,\dots,K_{\max}\) that best represents the model through \(f(\widetilde{y}\mid y)\).

ii) Selecting the lower dimensional posterior surrogate

  • The challenge now is to search for the surrogate with the smallest dimension that still has a good fit in relation to the posterior predictive distribution.

  • To achieve this goal, we propose a function that measures the discrepancy between the surrogate and the posterior predictive function \[\begin{equation} \delta_{n}^{k}(f,\widehat{g}^{k}_{\gamma}) = \log \frac{\widehat{g}^{k}_{\gamma}(\widetilde{y}_n)}{f_{\widehat{y}}(\widetilde{y}_n)},\quad n = 1,2,\dots,\widetilde{n} \end{equation}\]

  • A useful property of the discrepancy function is that it’s expection is an approximation of the Kullback- Liebler divergence: \[\begin{equation} E_{\widetilde{Y}}\left[\delta^k_{n}(f,\widehat{g}^{k}_{\gamma})\mid Y\right] = -\int f_{\widetilde{y}}(\tilde{y})\log\frac{f_{\widetilde{y}}(\widetilde{y})}{\widehat{g}^{k}_{\gamma}( \widetilde{y})}d{\widetilde{y}} = - \text{KL}\left(f_{\widetilde{y}}(\widetilde{y})\|\widehat{g}^k_{\gamma}(\widetilde{y})\right). \end{equation}\]

  • From this property we select the model with the smallest dimension before the discrepancy function deviates from

\[E_{\widetilde{Y}}\left[\delta^k_{n}(f,\widehat{g}^{k}_{\gamma})\mid Y\right] \approx 0\]

  • The model that deviates from zero are disconsidered. \(E_{\widetilde{Y}}\left[\delta^k_{n}(f,\widehat{g}^{k}_{\gamma})\mid Y\right] < 0\).

iv) Posterior Summarization step

  • We have a baseline for how far the dimension may be reduced without compromising predictive accuracy of the surrogate compared to the original model.

  • The lower-dimensional posterior summary is now obtained by projecting the posterior using this dimension as a reference.

\[ \gamma_{ps} = \arg\min_{\gamma \in \Gamma^{K^*}} \ell(\widetilde{y}; \gamma; g_{\gamma}) \]

  • For a fixed \(K^*\), we obtain the \(\gamma_{ps}\) from \(\widetilde{y}\sim f(\widetilde{y}\mid y)\) by

  • For this fixed \(\theta^{(m)}\sim \pi(\theta\mid y)\) we obtain a posterior predictive sample of size \(H\) as

    • \(\widetilde{y}_{1}^{(m)}, \widetilde{y}_{2}^{(m)}, \dots, \widetilde{y}_{H}^{(m)} \sim f(\widetilde{y}\mid \theta^{(m)})\)
    • We repeat this process for every \(\theta^{(m)}\) in \(m = 1,2,\dots,M\), resulting in \[ \begin{matrix} \widetilde{y}_{1}^{(1)}, & \widetilde{y}_{2}^{(2)}, & \dots & \widetilde{y}_{H}^{(1)} \\ \widetilde{y}_{1}^{(2)}, & \widetilde{y}_{2}^{(2)}, & \dots & \widetilde{y}_{H}^{(2)} \\ \vdots & \vdots & \dots & \vdots \\ \widetilde{y}_{1}^{(M)}, & \widetilde{y}_{2}^{(M)}, & \dots & \widetilde{y}_{H}^{(M)} \end{matrix} \]
  • We obtain a separate \(\gamma_{ps}^{(m)}\) for every \(m = 1,2,\dots,M\) as \[\gamma_{ps}^{(m)} = \arg \max_{\gamma \in \Gamma^{k^*}}\frac{1}{H} \sum_{h=1}^{H}\log \sum_{q=1}^{K^{*}} \eta_{q} g_{q} (\widetilde{y}_{n}^{(m)} | \gamma_{q})\]

  • From this process we obtain the sequence \(\gamma_{ps}^{(1)},\gamma_{ps}^{(2)}, \dots, \gamma_{ps}^{(M)}\) of posterior summaries with dimension \(K^*\), that approximate \(\gamma_{ps}\).

  • From this this distribution we can get and estimate by averaging over \(\gamma_{ps}^{(m)}\).

  • We can also obtain a density estimate by averaging over the surrogate, resulting \[ \bar{g}^{K^*}(y_i\mid \gamma_{ps}) = \frac{1}{M}\sum_{m = 1}^M g^{K^*}(y_i\mid \gamma_{ps}^{(m)}) \]

  • We obtain credible intervals for \(g^{K^*}(y_i\mid \gamma_{ps}^{(m)})\) from \(\gamma_{ps}^{(m)}\).

3. Posterior summarization for clustering

i) Introduction

  • Assume \(K^*\) is the lower dimension of the surrogate model created in the first phase of our approach. Starting from this point, we may either get a projection of the cluster estimates, where \(K^*\) is the maximum number of clusters, or obtain a projection of the posterior on the cluster estimates, resulting in the uncertainty estimates for the cluster allocation.

  • One key feature of our method is that it gives us the ability to choose different surrogate projections based on the kind of clustering allocation procedure. This makes the procedure modular and hence more versatile than other methods. In this paper, we use two distinct cluster surrogates projections. The first is the maximum conditional probability given the optimal actions \(\widehat{\gamma}_1,\widehat{\gamma}_2,\dots,\widehat{\gamma}_{K^*}\). The second is the k-means distance, in which our case uses the posterior predictive sample to find optimum centroids from which to assign data.

ii) Cluster allocation using conditional probability

  • In the first surrogate, we first obtain the conditional probability surrogates \(\widehat{\eta}_{c}(y_i)\) as \[\widehat{\eta}_{c}(y_i) \propto \widehat{\eta}_{c} g (y_{i} | \widehat{\gamma}_{c}), \quad y_{i},\quad i = 1, 2, \dots, n, \quad c = 1,2,\dots,K^* \]

  • The cluster allocation estimates \(\widehat{c}_{i}\) is then obtained as \[ \widehat{c}_{i} = \arg\max_{c}(\widehat{\eta}_{c} (y_{i})),\quad i = 1,2,\dots,n, \quad c = 1,2,\dots,K^* \]

  • Under this projection choice, posterior summaries for the clustering allocation is obtained by conditioning on the actions \(\gamma_{ps}^{(m)}\), for \(m = 1,2,\dots,M\), resulting in \[ \eta_{c}^{(m)}(y_i) \propto \eta_{c}^{(m)} g (y_{i} | \gamma_{ps}^{(m)}), \quad y_{i},\quad i = 1, 2, \dots, n , \quad c = 1,2,\dots,K^* \] which in turn results in the posterior projection allocations: \[ \widehat{c}^{(m)}_{i} = \arg\max_{c} (\eta_{c}^{(m)} (y_{i})),\quad y_{i},\quad i = 1, 2, \dots, n , \quad c = 1,2,\dots,K^*, \] thus resulting in an approximation of the posterior projection on the clustering allocation for the observation \(y_i\). From this result, we can obtain uncertainty quantification for each allocation or get the posterior average estimate.

iii) K-means clustering allocation projection

  • In the k-means allocation projection, we use the posterior predictive sample \(\widetilde{y}_1,\widetilde{y}_2, \dots,\widetilde{y}_\tilde{n}\) from the first part of our method to obtain an estimation for the centroids \(\gamma_c\) that have dimension \(K^*\), from which we determine the cluster allocations.This estimation is used through the minimization of the following loss function \[ \ell(\widetilde{y}; \gamma_c ) = || \widetilde{y} - \gamma_c ||_{2}^{2}. \] Again we use the decision theoric approach to obtain the optimal centroids as

where \(\widehat{\gamma}_c\) are the optimal centroids obtained as \[ \widehat{\gamma}_c,c = \arg \min_{\gamma,c}E_{\widetilde{Y}} \left[\sum_{n = 1}^{\widetilde{n}}\sum_{k = 1}^{K^*}\mathbb{1}\{c = n\} \|\widetilde{y}_n-\gamma_k\|^2\mid Y\right], \]

  • By solving the previous minimization problem results in \(\widehat{\gamma}_c\) that are optimal centroids we have the optimal clustering \[ \widehat{c}(y_i) = \arg\min_{\gamma_c \in \{1,2,\dots,K^*\}} || y_i - \widehat{\gamma}_c ||_{2}^{2},\quad \text{for}\quad i = 1,2,\dots,n \]

  • We can again obtain the posterior summarization for the cluster allocation using this projection by \[ \gamma_{c}^{(m)},c = \arg \min_{\gamma,c}E_{\widetilde{Y}} \left[\sum_{h = 1}^{H}\sum_{k = 1}^{K^*}\mathbb{1}\{c = h\} \|\widetilde{y}_{h}^{(m)}-\gamma_k\|^2\mid Y\right],\quad \quad m = 1,2,\dots,M \]

  • Which again we can get the posterior projection on the cluster allocation as \[ {c}^{(m)}(y_i) = \arg\min_{c \in \{1,2,\dots,K^*\}} || y_i - \gamma_{c}^{(m)} ||_{2}^{2},\quad \text{for}\quad i = 1,2,\dots,n,\quad m = 1,2,\dots,M \] which results in uncertainty quantification for the cluster alloction.

4. Numerical Applications

# Loading files
setwd("/Users/hb23255/Dropbox/DSS_MIX/DSS_MIX_06_2024/")
source("source/dcpossum_dens_comp.R")
source("source/dcpossum_sim_data_mix.R")
source("source/dcpossum_plots.R")
source("source/dcpossum_unc.R")
source("source/func_pred_laplace_temp.R")
source("source/dcpossum_clust.R")
source("source/dcpossum_clust_compar.R")

4.1 Univariate Examples

i) Example 1 - five component Gaussian mixture:

\[ 0.2 \times N(19,5) + 0.2 \times N(19,1) + 0.25\times N(23,1) + 0.2\times N(29,1) + 0.15\times N(33,2) \]

set.seed(1820) 
# y.data.1 = data_sim_func("example_02", n = 1000)
setwd("/Users/hb23255/Dropbox/DSS_MIX/DSS_MIX_06_2024/")
y.data.1 = read.csv("source/BP_source/sim_bp/bp_1000/data_bp_01_03.csv")
y.data.1 = y.data.1$x
MFM.1 = dcpossum.MFM(y.data.1,kmax = 10, quant.sample = 1000)
DPM.1 = dcpossum.DPM(y.data.1,kmax = 10, quant.sample = 1000)
BP.1 = dcpossum.BP(y.data.1, kmax = 10, BP.run = F, quant.sample = 1000, data.bp = "sim_1000", pred.f = FALSE)
# Plot - identify optimal components
comp.1 = plot.possum.comp.uni(BP.1,DPM.1,MFM.1, title = "", kmax = 10, y.data = y.data.1, sel.K = 5)
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
comp.1

# Plot - number of components from the methods
comp.2 = plot.postcomp.uni(DPM.1,MFM.1,BP.1,K.true = 5)
## `summarise()` has grouped output by 'n.comp'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'n.comp'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'n.comp'. You can override using the
## `.groups` argument.
comp.2

possum.BP.1 = possum.unc.quant.values(BP.1, K.sel = 5)
possum.MFM.1 = possum.unc.quant.values(MFM.1, K.sel = 5)
possum.DPM.1 = possum.unc.quant.values(DPM.1, K.sel = 5)
plots.possum = plot.possum.unc.quant.dc(possum.BP.1,possum.MFM.1,possum.DPM.1, K.sel = 5, 
                                        y.data.1, index.possum = "example_02")
possum.p = plots.possum$dens.summ
possum.p

  • Below we have the Hellinger distance between the lower-dimensional estimate, the posterior summary estimate, and the true distribution. Here we evaluate how these estimates approximate the true distribution under \(n\in\{100,250,1000\}\)

\[ H(f, g) = \frac{1}{\sqrt{2}} \left( \int \left( \sqrt{f(x)} - \sqrt{g(x)} \right)^2 \, dx \right)^{\frac{1}{2}} \]

Centered Image
Centered Image

ii) Example 2 - Laplace Mixture Model:

\[ 0.4\times Laplace(-5,1.5) + 0.6\times Laplace(5,1) \]

set.seed(1820) 
y.data.lap = data_sim_func("examp_laplace", n = 600)
  • Plot of the BIC using mclust:
plot(mclustBIC(y.data.lap))

# Gaussian surrogate - mclust
temp.MFM = dcpossum.MFM(y.data.lap,kmax = 10, quant.sample = 1000)
temp.DPM = dcpossum.DPM(y.data.lap,kmax = 10, quant.sample = 1000)
temp.BP = dcpossum.BP(y.data.lap, kmax = 10, BP.run = F, quant.sample = 1000, data.bp = "examp_lapalce")
# Comparison of different methods
comp.1 = possum.comparison.laplace(temp.BP,temp.MFM,temp.DPM)
comp.1

# Plot - number of components from the methods
comp.2 = plot.postcomp.uni(temp.DPM,temp.MFM,temp.BP,K.true = 2)
comp.2

# Generate possum for gaussian surrogate
possum.BP.1 = possum.unc.quant.values(temp.BP, K.sel = 2)
possum.MFM.1 = possum.unc.quant.values(temp.MFM, K.sel = 2)
possum.DPM.1 = possum.unc.quant.values(temp.DPM, K.sel = 2)
# Gaussian surrogate possum
plots.possum = plot.possum.unc.quant.dc(possum.BP.1,possum.MFM.1,possum.DPM.1, 
                                        K.sel = 2, y.data.lap, index.possum = "examp_laplace")

comp.4 = possum.laplace.pred.all(temp.BP,temp.MFM,temp.DPM, y.data.lap, scale.plot = c(-13,11), 
                                 K.sel = 2, kmax = 10, quant = c(0.025,0.975))

possum.dens = plots.possum$dens.summ

possum.dens

comp.4

  • The plot on the left displays the distribution of the posterior summary using a finite Gaussian mixture model using two components. On the right, we have the same, but using the Laplace finite mixture model surrogate.
comp.3 = possum.laplace.coef(temp.BP,temp.MFM,temp.DPM)

possum.par = plots.possum$param.sum
possum.par + comp.3

iii) Application - Galaxy data

set.seed(1820) 
y.data.app = data_sim_func("galaxy")

MFM.app = dcpossum.MFM(y.data.app,kmax = 10, quant.sample = 1000)
DPM.app = dcpossum.DPM(y.data.app,kmax = 10, quant.sample = 1000)
BP.app = dcpossum.BP(y.data.app, kmax = 10, BP.run = F, quant.sample = 1000, data.bp = "galaxy")

# Plot - identify optimal components
comp.1 = plot.possum.comp.uni(BP.app,DPM.app,MFM.app, title = "", kmax = 10, y.data = y.data.app)
comp.1

comp.2 = plot.postcomp.uni(DPM.app,MFM.app,BP.app,K.true = 0)
comp.2

possum.BP.1 = possum.unc.quant.values(BP.app, K.sel = 4)
possum.MFM.1 = possum.unc.quant.values(MFM.app, K.sel = 4)
possum.DPM.1 = possum.unc.quant.values(DPM.app, K.sel = 4)
# for gaussian surrogate
plots.possum = plot.possum.unc.quant.dc(possum.BP.1,possum.MFM.1,possum.DPM.1, K.sel = 4, 
                                        y.data = y.data.app, scale.plot = c(5.5,37), index.possum = FALSE)

possum.p = plots.possum$dens.summ

possum.p

4.2) Multivariate Examples

i) Example 1 - three gaussian component mixture with two dimensions

Consider the data distribution \(\sum_{i=1}^3 \eta_i N(\mu_i, C_i)\) where \[ \eta = (0.45, 0.3, 0.25), \quad \mu_1 = \begin{pmatrix} 4 \\ 4 \end{pmatrix}, \quad \mu_2 = \begin{pmatrix} 7 \\ 4 \end{pmatrix}, \quad \mu_3 = \begin{pmatrix} 6 \\ 2 \end{pmatrix}, \] \[ C_1 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad C_2 = R \begin{pmatrix} 2.5 & 0 \\ 0 & 0.2 \end{pmatrix} R^T \quad \text{where} \quad R = \begin{pmatrix} \cos \rho & -\sin \rho \\ \sin \rho & \cos \rho \end{pmatrix} \quad \text{with} \quad \rho = \pi / 4, \] \[ \text{and} \quad C_3 = \begin{pmatrix} 3 & 0 \\ 0 & 0.1 \end{pmatrix}. \]

set.seed(1822)
y.data.exam.03 = data_sim_func("example_mult_03_clust", n = 1000)
# plot(y.data.2[,1],y.data.2[,2])

y.data.2 = y.data.exam.03[,c(1:2)]
y.clust = y.data.exam.03[,3]
# Working
# Change 95% quantile
temp.MFM.mult = dcpossum.MFM.mult(y.data.2, kmax = 10, quant.sample = 1000)

# Working
temp.DPM.mult = dcpossum.DPM.mult(y.data.2, kmax = 10, quant.sample = 1000)

# Working
temp.SFM.mult = dcpossum.SFM.mult(y.data.2, kmax = 10, p.dim = 2, col.data = 1:2, 
                                  clust.info = 0, quant.sample = 1000, 
                                  post.size = 1000)
# Change name of this plot - mult and uni
# change - tags in dcpossum - let insert tag order
# Plot discrepancy function given number of factors 
comp.estim = plot.estim.comp.mult(temp.MFM.mult,temp.DPM.mult,temp.SFM.mult, sel.K = 3, title = "Discrenpancy function")
comp.estim

# change name of this plot
# change - tags in dcpossum - let insert tag order
# Plot number of factors 
comp.mult = plot.postcomp.mult(temp.DPM.mult,temp.MFM.mult,temp.SFM.mult, K.true = 3)
comp.mult

MFM.dens = temp.MFM.mult
DPM.dens = temp.DPM.mult
SFM.dens = temp.SFM.mult

plot.pred = plot_pred_dens_mult(SFM.dens,DPM.dens,MFM.dens, y.data.2, index = "example_02", sel.K = 3, plot_density = "simulation")
plot.pred

possum.SFM = possum.unc.quant.values.mult(temp.SFM.mult,K.sel = 3)
possum.DPM = possum.unc.quant.values.mult(temp.DPM.mult,K.sel = 3)
possum.MFM = possum.unc.quant.values.mult(temp.MFM.mult,K.sel = 3)

# add histogram 
# point estimate
plot.possum.marg = plot.possum.uncertain.quant.dc.mult(possum.SFM, possum.MFM, possum.DPM, 
                                                       K.sel = 3, y.data.2, scale.plot = FALSE,
                                                      index.possum = FALSE, quant = c(0.025,0.975))
plot.possum.marg

ii) Uncertainty quantification in clustering

possum_clust = dc.possum.clust(temp.SFM.mult, y.data.2, K.sel = 3, km = TRUE)

plot.unc.clust = ggplot(possum_clust,aes(x = dat.V1, y = dat.V2, color = clust.prob)) +
  geom_point(size = 0.6) +
  geom_point() +
  theme_light() +
  xlab("") +
  ylab("") +
  theme(legend.position="bottom") +
  # scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, .75)) 
  scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, 0.75)) +
  ggtitle("Uncertainty - soft")

plot.unc.km = ggplot(possum_clust,aes(x = dat.V1, y = dat.V2, color = clust.prob.km)) +
  geom_point(size = 0.6) +
  geom_point() +
  theme_light() +
  xlab("") +
  ylab("") +
  theme(legend.position="bottom") +
  # scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, .75)) 
  scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, 0.75)) +
  ggtitle("Uncertainty - K-means")

optim.clust = dc.optim.clust(temp.SFM.mult, y.data.2, K.sel = 3)

optim.clust.kmeans = dc.optim.clust.kmeans(temp.SFM.mult, y.data.2, K.sel = 3)

plot.optim = ggplot(optim.clust,aes(x = dat.V1, y = dat.V2, shape = as.factor(optim.clust), color = as.factor(optim.clust))) +
  geom_point(size = 0.6) +
  geom_point() +
  theme_light() +
  xlab("") +
  ylab("") +
  theme(legend.position="bottom") +
  ggtitle("Optimal - soft")

plot.optim.kmeans = ggplot(optim.clust.kmeans, aes(x = dat.V1, y = dat.V2, shape = as.factor(clust.km), color = as.factor(clust.km))) +
  geom_point(size = 0.6) +
  geom_point() +
  theme_light() +
  xlab("") +
  ylab("") +
  theme(legend.position="bottom") +
  ggtitle("Optimal - kmeans")

y.data.examp = as.data.frame(y.data.exam.03)

gb = GBclust.func(y.data.2, H = 3)

true_clust = ggplot(y.data.examp, aes(x = V1, y = V2, shape = as.factor(clust_assign), color = as.factor(clust_assign))) +
  geom_point(size = 0.6) +
  geom_point() +
  theme_light() +
  xlab("") +
  ylab("") +
  theme(legend.position="bottom") +
  ggtitle("True")

(true_clust + plot.optim + plot.optim.kmeans) / (plot.unc.clust + plot.unc.km + gb) 

iii) Application - Thyroid data - three known components and five dimensions

set.seed(1234)
y.data.thyroid = data_sim_func("thyroid")
y.data.clust = data_sim_func("thyroid_clust")

temp.MFM.mult = dcpossum.MFM.mult(y.data.thyroid, kmax = 10, quant.sample = 1000)

temp.DPM.mult = dcpossum.DPM.mult(y.data.thyroid, kmax = 10, quant.sample = 1000)

temp.SFM.mult = dcpossum.SFM.mult(y.data.thyroid, kmax = 10, p.dim = 5, col.data = 1:5, 
                                  clust.info = 0, quant.sample = 1000, post.size = 1000)
comp.estim = plot.estim.comp.mult(temp.MFM.mult,temp.DPM.mult,temp.SFM.mult, title = "Discrenpancy function")
comp.estim

comp.model = plot.postcomp.mult(temp.MFM.mult,temp.DPM.mult,temp.SFM.mult, K.true = 3)
comp.model

iv) Clustering plots thyroid data

possum_clust = dc.possum.clust(temp.SFM.mult, y.data = y.data.thyroid, K.sel = 3, km = FALSE)

optim.clust = dc.optim.clust(temp.SFM.mult, y.data.thyroid, K.sel = 3)
# 
# optim.clust.kmeans = dc.optim.clust.kmeans(temp.SFM.mult, y.data.thyroid, K.sel = 3)

category_mapping <- c("Normal" = 1, "Hypo" = 3, "Hyper" = 2)

# Transforming the categories using mutate and recode
transformed_data <- thyroid %>%
  mutate(Category_Numeric = recode(Diagnosis, !!!category_mapping))

clust.sfm = unlist(temp.SFM.mult[[8]])
kmax.sfm = max(clust.sfm)
legend_title <- "Groups"

colors.temp <- c("#619CFF", "#F8766D", "#00BA38", "#F564E3", "#00BFC4","#B79F00", "#F564E3", "#FF64B0", "#00BF7D", "#A3A500")
col.sfm = colors.temp[1:kmax.sfm]

shapes.temp <- c(15, 16, 17, 3, 7, 8, 18, 4, 19, 20)
shape.sfm = shapes.temp[1:kmax.sfm]

p1 = ggplot(transformed_data) +
  geom_point(aes(x = RT3U, y = T4, shape = as.factor(Category_Numeric), col = as.factor(Category_Numeric)), size = 1.2) +
  theme_light() +
  theme(legend.position="bottom") +
  labs(title="True", x ="RT3U", y = "T4") + scale_shape_manual(legend_title, values = c(15, 16, 17)) +
  scale_colour_manual(legend_title, values=c("#619CFF", "#F8766D", "#00BA38"))

p2 = ggplot(optim.clust) +
  geom_point(aes(x = dat.RT3U,y = dat.T4, shape = as.factor(optim.clust), col = as.factor(optim.clust)), size = 1.2) +
  theme_light()  +
  theme(legend.position="bottom") +
  labs(title="Optimal Clustering classification -", x = "RT3U", y = "T4") + scale_shape_manual(legend_title, values = c(15, 16, 17)) +
  scale_colour_manual(legend_title, values=c("#619CFF", "#F8766D", "#00BA38"))

# p2 = ggplot(optim.clust.kmeans) +
#   geom_point(aes(x = dat.RT3U,y = dat.T4, shape = as.factor(optim.clust), col = as.factor(optim.clust)), size = 1) +
#   theme_light()  +
#   theme(legend.position="bottom") +
#   labs(title="Optimal Clustering classification -", x = "RT3U", y = "T4") + scale_shape_manual(legend_title, values = c(15, 16, 17)) +
#   scale_colour_manual(legend_title, values=c("#619CFF", "#F8766D", "#00BA38"))

p3 = ggplot(possum_clust) +
  geom_point(aes(x = dat.RT3U,y = dat.T4, col = clust.prob), size = 1.2) +
  scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, max(possum_clust$clust.prob))) +
  theme_light() +
  theme(legend.position="bottom") +
  labs(title = "Possum uncertainty estimate", x = "RT3U", y = "T4")

p4 = ggplot(optim.clust) +
  geom_point(aes(x = dat.RT3U,y = dat.T4, shape = as.factor(clust.sfm), col = as.factor(clust.sfm)), size = 1.2) +
  theme_light()  +
  theme(legend.position="bottom") +
  labs(title="SFM classfication", x = "SRT3U", y = "T4") + scale_shape_manual(legend_title, values = shape.sfm) +
  scale_colour_manual(legend_title, values = col.sfm)


(p1 + p2) / (p3 + p4)

v) Rand index and error rate

true_groups = transformed_data[,7]
table(true_groups, optim.clust$optim.clust)
##            
## true_groups   1   2   3
##           1   2 146   2
##           2  35   0   0
##           3   0   4  26
adjustedRandIndex(true_groups, optim.clust$optim.clust)
## [1] 0.8779971
classError(optim.clust$optim.clust, true_groups)$errorRate
## [1] 0.0372093
table(true_groups, clust.sfm)
##            clust.sfm
## true_groups   1   2   3   4
##           1   1 147   0   2
##           2  34   1   0   0
##           3   0   4  15  11
adjustedRandIndex(true_groups, clust.sfm)
## [1] 0.8654924
classError(clust.sfm, true_groups)$errorRate
## [1] 0.08837209